Introduction

The purpose of this notebook is to explore linear regression techniques from the Coursera specialization in Machine Learning.

# Load the required packages
library(tidyverse)
library(here)
library(janitor)

Load data

The data we are using for this exploration is the Perth House Prices data as found on Kaggle.

# Load the data
house_prices <- 
  read_csv(here("data/src/all_perth_310121.csv"))

# View the head
head(house_prices)
## # A tibble: 6 × 19
##   ADDRESS           SUBURB  PRICE BEDROOMS BATHROOMS GARAGE LAND_AREA FLOOR_AREA
##   <chr>             <chr>   <dbl>    <dbl>     <dbl> <chr>      <dbl>      <dbl>
## 1 1 Acorn Place     South… 565000        4         2 2            600        160
## 2 1 Addis Way       Wandi  365000        3         2 2            351        139
## 3 1 Ainsley Court   Camil… 287000        3         1 1            719         86
## 4 1 Albert Street   Belle… 255000        2         1 2            651         59
## 5 1 Aman Place      Lockr… 325000        4         1 2            466        131
## 6 1 Amethyst Cresc… Mount… 409000        4         2 1            759        118
## # ℹ 11 more variables: BUILD_YEAR <chr>, CBD_DIST <dbl>, NEAREST_STN <chr>,
## #   NEAREST_STN_DIST <dbl>, DATE_SOLD <chr>, POSTCODE <dbl>, LATITUDE <dbl>,
## #   LONGITUDE <dbl>, NEAREST_SCH <chr>, NEAREST_SCH_DIST <dbl>,
## #   NEAREST_SCH_RANK <dbl>

Clean data

We will perform the following transformations to produce a clean dataset to work with for this project:

  • Convert column names to lowercase using the janitor::clean_names() function
  • Change GARAGE, BUILD_YEAR from character to numeric
  • Split DATE_SOLD into separate columns for year and month
  • Change POSTCODE from numeric to character
  • Filter records to include only house prices from a single year (to avoid too much “noise” in the pricing data due to inflation)

Clean column names

We convert the column names to lowercase to make them easier to work with in code.

house_prices_cln <- 
  house_prices |> 
  clean_names()

names(house_prices_cln)
##  [1] "address"          "suburb"           "price"            "bedrooms"        
##  [5] "bathrooms"        "garage"           "land_area"        "floor_area"      
##  [9] "build_year"       "cbd_dist"         "nearest_stn"      "nearest_stn_dist"
## [13] "date_sold"        "postcode"         "latitude"         "longitude"       
## [17] "nearest_sch"      "nearest_sch_dist" "nearest_sch_rank"

Change data types

We correct the data types for some of the fields. Before doing this we check the contents of each column we want to convert.

garage

house_prices_cln |> 
  count(garage)
## # A tibble: 26 × 2
##    garage     n
##    <chr>  <int>
##  1 1       5290
##  2 10        26
##  3 11         7
##  4 12        30
##  5 13         8
##  6 14        13
##  7 16         4
##  8 17         1
##  9 18         3
## 10 2      20724
## # ℹ 16 more rows

All values are legitimate numbers. We can convert this column to numeric.

house_prices_cln <- 
  house_prices_cln |> 
  mutate(garage = as.numeric(garage))

house_prices_cln |> 
  count(garage)
## # A tibble: 26 × 2
##    garage     n
##     <dbl> <int>
##  1      1  5290
##  2      2 20724
##  3      3  2042
##  4      4  1949
##  5      5   362
##  6      6   466
##  7      7    97
##  8      8   129
##  9      9    17
## 10     10    26
## # ℹ 16 more rows